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Abstract 



Fast nearest neighbor searching is becom- 
ing an increasingly important tool in solv- 
ing many large-scale problems. Recently 
a number of approaches to learning data- 
dependent hash functions have been devel- 
oped. In this work, we propose a column 
generation based method for learning data- 
dependent hash functions on the basis of 
proximity comparison information. Given a 
set of triplets that encode the pairwise prox- 
imity comparison information, our method 
learns hash functions that preserve the rel- 
ative comparison relationships in the data 
as well as possible within the large-margin 
learning framework. The learning procedure 
is implemented using column generation and 
hence is named CGHash. At each iteration 
of the column generation procedure, the best 
hash function is selected. Unlike most other 
hashing methods, our method generalizes to 
new data points naturally; and has a train- 
ing objective which is convex, thus ensur- 
ing that the global optimum can be identi- 
fied. Experiments demonstrate that the pro- 
posed method learns compact binary codes 
and that its retrieval performance compares 
favorably with state-of-the-art methods when 
tested on a few benchmark datasets. 



* indicates equal contributions. 



Proceedings of the 30 th International Conference on Ma- 
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: 
W&CP volume 28. Copyright 2013 by the author(s). 



1. Introduction 

The explosive growth in the volume of data to be pro- 
cessed in applications such as web search and mul- 
timedia retrieval increasingly demands fast similarity 
search and efficient data indexing/storage techniques. 
Considerable effort has been spent on designing hash- 
ing methods which address both the issues of fast 
similarity search and efficient data storage (for exam- 
ple, (Andoni & Indyk, 2006; Weiss ct al., 2008; Zhang 
et al., 2010b; Norouzi & Fleet, 2011; Kulis & Darrell, 
2009; Gong et al., 2012)). A hashing-based approach 
constructs a set of hash functions that map high- 
dimensional data samples to low-dimensional binary 
codes. These binary codes can be easily loaded into 
the memory in order to allow rapid retrieval of data 
samples. Moreover, the pairwise Hamming distance 
between these binary codes can be efficiently computed 
by using bit operations, which are well supported by 
modern processors, thus enabling efficient similarity 
calculation on large-scale datasets. Hash-based ap- 
proaches have thus found a wide range of applica- 
tions, including object recognition (Torralba et al., 
2008), information retrieval (Zhang et al., 2010b), lo- 
cal descriptor compression (Strecha et al., 2011), im- 
age matching (Korman & Avidan, 2011), and many 
more. Recently a number of effective hashing meth- 
ods have been developed which construct a variety of 
hash functions, mainly on the assumption that seman- 
tically similar data samples should have similar binary 
codes, such as random projection-based locality sensi- 
tive hashing (LSH) (Andoni & Indyk, 2006), boost- 
ing learning-based similarity sensitive coding (SSC) 
(Shakhnarovich et al., 2003), and spectral hashing of 
Weiss et al. (2008) which is inspired by Laplacian 
eigenmap. 

In more detail, spectral hashing (Weiss et al., 2008) 
optimizes a graph Laplacian based objective func- 
tion such that in the learned low-dimensional binary 
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space, the local neighborhood structure of the orig- 
inal dataset is best preserved. SSC (Shakhnarovich 
ct al., 2003) makes use of boosting to adaptively learn 
an embedding of the original space, represented by a 
set of weak learners or hash functions. This embed- 
ding aims to preserve the pairwise affinity relation- 
ships of training duplets (i.e., pairs of samples in the 
original space). These approaches have demonstrated 
that, in general, data-dependent hashing is superior 
to data-independent hashing with a typical example 
being LSH (Andoni & Indyk, 2006). 

Following this vein, here we learn hash functions us- 
ing side information that is generally presented in a 
set of triplet-based constraints. Note that the triples 
used for training can be generated in an either super- 
vised or unsupervised fashion. The fundamental idea 
is to learn optimal hash functions such that, when 
using the learned weighted Hamming distance, the 
relative distance comparisons of the form "point x 
is closer to x + than to x~" are satisfied as well as 
possible (x + and x~ are respectively relevant and ir- 
relevant samples to x). This type of relative prox- 
imity comparisons have been successfully applied to 
learn quadratic distance metrics (Schultz & Joachims, 
2004; Shen et al., 2012). Usually this type of prox- 
imity relationships do not require explicit class labels 
and thus are easier to obtain than either the class la- 
bels or the actual distances between data points. For 
instance, in content based image retrieval, to collect 
feedback, users may be required to report whether im- 
age x looks more similar to x + than it is to a third im- 
age x . This task is typically much easier than to la- 
bel each individual image. Formally, we are given a set 
6 = {(x l ,x+,x^)|d(x. J ,x+) < d(xi,Xi)},i = 1,2, • • • , 
where d(-, •) is some similarity measure (e.g., Euclidean 
distance in the original space; or semantic similarity 
measure provided by a user). As explained, one may 
not explicitly know <i(-, •); instead, one may only be 
able to provide sparse proximity relationships. Us- 
ing such a set of constraints, we formulate a learning 
problem in the large-margin framework. By using a 
convex surrogate loss function, a convex optimization 
problem is obtained, but has an exponentially large 
number of variables. Column generation is thus em- 
ployed to efficiently and optimally solve the formulated 
optimization problem. 

The main contribution of this work is to propose a 
novel hash function learning framework which has the 
following desirable properties, (i) The formulated opti- 
mization problem can be globally optimized. We show 
that column generation can be used to iteratively find 
the optimal hash functions. The weights of all the 
selected hash functions for calculating the weighted 



Hamming distance are updated at each iteration, (ii) 
The proposed framework is flexible and can accommo- 
date various types of constraints. We show how to 
learn hash functions based on proximity comparisons. 
Furthermore, the framework can accommodate differ- 
ent types of loss functions as well as regularization 
terms. Also, our hashing framework can use different 
types of hash functions such as linear functions, deci- 
sion stumps/trees, RBF kernel functions, etc. 

Related work Loosely speaking, hashing methods 
may be categorized into two groups: data-independent 
and data-dependent. Without using any training data, 
data-independent hashing methods usually generate a 
set of hash functions using randomization. For in- 
stance, LSH of Andoni & Indyk (2006) use random 
projection and thresholding to generate binary codes 
in the Hamming space, where the mutually close data 
samples in the Euclidean space are likely to have sim- 
ilar binary codes. Recently, Kulis & Grauman (2009) 
propose a kernelized version of LSH, which is capa- 
ble of capturing the intrinsic relationships between 
data samples using kernels instead of linear inner 
products. In terms of learning methodology, data- 
dependent hashing methods can make use of unsu- 
pervised, supervised or semi-supervised learning tech- 
niques to learn a set of hash functions that gener- 
ate the compact binary codes. As for unsupervised 
learning, two typical approaches are used to obtain 
such compact binary codes, including thresholding the 
real-valued low-dimensional vectors (after dimension- 
ality reduction) and direct optimization of a Hamming 
distance based objective function (e.g., spectral hash- 
ing (Weiss et al., 2008), self-taught hashing (Zhang 
et al., 2010b)). The spectral hashing (SPH) method 
directly optimizes a graph Laplacian objective func- 
tion in the Hamming space. Inspired by SPH, Zhang 
ct al. (2010b) developed the self-taught hashing (STH) 
method. At the first step of STH, Laplacian graph em- 
bedding is used to generate a sequence of binary codes 
for each sample. By viewing these binary codes as 
binary classification labels, a set of hash functions are 
obtained by training a set of bit-specific linear support 
vector machines. Liu et al. (2011) proposed a scalable 
graph-based hashing method which uses a small-size 
anchor graph to approximate the original neighbor- 
hood graph and alleviates the computational limita- 
tion of spectral hashing. 

As for the supervised learning CcLS6, cL number of hash- 
ing methods take advantage of labeled training sam- 
ples to build data-dependent hash functions. These 
hashing methods often formulate hash function learn- 
ing as a classification problem. For example, Salakhut- 
dinov & Hinton (2009) proposed the restricted Boltz- 
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mann machine (RBM) hashing method using a multi- 
layer deep learning technique for binary code genera- 
tion. Strecha et al. (2011) use Fisher linear discrimi- 
nant analysis (LDA) to embed the original data sam- 
ples into a lower-dimensional space, where the em- 
bedded data samples are binarized using thresholding. 
Boosting methods have also been employed to develop 
hashing methods such as SSC (Shakhnarovich et al., 
2003) and Forgiving Hash (Baluja & Covell, 2008), 
both of which learn a set of weak learners as hash 
functions in the boosting framework. It is demon- 
strated in (Torralba et al., 2008) that some data- 
dependent hashing methods like stacked RBM and 
boosting SSC perform much better than LSH on large- 
scale databases of millions of images. Wang et al. 
(2012) proposed a semi-supervised hashing method, 
which aims to ensure the smoothness of similar data 
samples and the separability of dissimilar data sam- 
ples. More recently, Liu et al. (2012) introduced a 
kernel-based supervised hashing method, where the 
hashing functions arc nonlinear kernel functions. 

The closest work to ours might be boosting based 
SSC hashing (Shakhnarovich et al., 2003), which also 
learns a set of weighted hash functions through boost- 
ing learning. Ours differs SSC in the learning pro- 
cedure. The resulting optimization problem of our 
CGHash is based on the concept of margin maximiza- 
tion. We have derived a meaningful Lagrange dual 
problem such that column generation can be applied to 
solve the semi-infinite optimization problem. In con- 
trast, SSC is built on the learning procedure of Ad- 
aBoost, which employs stage- wise coordinate-descent 
optimization. The weights associated with selected 
hash functions (corresponding weak classifiers in Ad- 
aBoost) are not fully updated at each iteration. Also 
the information used for training is different. We have 
used distance comparison information and SSC uses 
pairwise information. In addition, our work can ac- 
commodate various types of constraints, and can flex- 
ibly adapt to different types of loss functions as well 
as regularization terms. It is unclear, for example, 
how SSC can accommodate different types regulariza- 
tion that may encode useful prior information. In this 
sense our CGHash is much more flexible. Next, we 
present our main results. 

2. The proposed algorithm 

Given a set of training samples x m G H D , (to = 
1,2,...), we aim to learn a set of hash functions 
hj(x) G JC, j = 1,2 ,...£, for mapping these train- 
ing samples to a low-dimensional binary space, be- 
ing described by a set of binary codewords b m , (m = 
1,2,...). Here each b TO is an ^-dimensional binary vec- 



tors. In the low-dimensional binary space, the code- 
words b m 's are supposed to preserve the underlying 
proximity information of corresponding Xj's in the 
original high-dimensional space. Next we learn such 
hash functions {/ij(x)}^ =1 within the large-margin 
learning framework. 

Formally, suppose that we are given a set of triplets 
{(x^x^xr)}^ with x„x+xr G R D and 3 be- 
ing the triplet index set. These triplets encode the 
proximity comparison information such that the dis- 
tance/dissimilarity between Xj and x.f is smaller than 
that between Xj and x~. Now we need to define the 
weighted Hamming distance for the learned binary 
codes: dj^ (x, z) = Y^j=i w j I hj ( x ) — hj ( z ) I > where Wj is 
a non-negative weight factor associated with the j'-th 
hash function. In our experiments, we have gener- 
ated the triplets set as: Xj and x.f belong to the same 
class and Xj and x~ belong to different classes. As 
discussed, these triplets may be sparsely provided by 
users in applications such as image retrieval. So we 
want the constraints dot (xj , ) < d^c(x.i,x~) to be 
satisfied as well as possible. For notational simplicity, 



we define a 



»] 



\hj (xj) - hj (x- )| - \hj(xi) - hj (x+) | 
and djc(xi,xT) - d M (x i5 x+) = w T a. t with 

a i = [4 i] ,4 il ,...,af] T . (1) 

In what follows, we describe the details of our hashing 
algorithm using different types of convex loss functions 
and regularization norms. In theory, any convex loss 
and regularization can be used in our hashing frame- 
work. More details of our hashing algorithm can be 
found in Algorithm 1 and the supplementary file (Li 
et al., 2013). 

2.1. Learning hashing functions with the hinge 
loss 

Hashing with l\ norm regularization Using the 
hinge loss, we define the following large-margin opti- 
mization problem: 

s.t. 0,^^=0; ( 2 ) 

djc(xi,x7) - d w (xi,xt) > 1 - £ 4) Vi, 

where || • ||i is the 1-norm, w = (w%, W2, ■ • ■ , wi) T is the 
weight vector; £ is the slack variable; C is a parameter 
controlling the trade-off between the training error and 
model capacity, and the symbol ')>=' indicates element- 
wise inequalities. The optimization problem (2) can 
be rewritten as: 

w.5 (3) 

s.t. w ^ 0; aju) > 1 - & > 0, Vi, 
where 1 is the all-one column vector. The correspond- 



Learning Hash Functions Using Column Generation 



Algorithm 1 Hashing using column generation 

Input : Training triplets { (xj , x^~ , x~ ) } , i = 1 , 2 • • • and 
I, the number of hash functions. 

Output: Learned hash functions {/ij(x)}* =1 and the 
associated weights w. 
Initialize: u j^j. 
for j = 1 to i do 

1. Find the best hash function hj (■) by solving the sub- 
problem (11). 

2. Add hj(-) to the hash function set; 

3. Update &i, Vi as in (1); 

4. Solve the primal problem for w (using LBFGS-B 
(Zhu et al., 1997)) and obtain the dual variable u 
using KKT condition (10). 

endfor 



ing dual problem is: 

max l T u, s.t. Au =4 CI, u 1, (4) 

where the matrix A = (ai, a2, . . . , aiji) € Jl ex \ J \ and 
the symbol £ =^' indicates element-wise inequalities. 

Hashing with norm regularization The primal 
problem is formulated as: 

min J2lli & + c \\ w \\°° 
s.t. w > 0,a[io > 1 > 0,V«. 

We can make the regularization a constraint, 

s.t. to )p 0.HIOO < C";alw > 1-6,6 > 0, Vi 

(5) 

The dual form of the above optimization problem is: 

min -l T u + C"l T q 

»>q (6) 

s.t. ^ q, u =<: 1, 

where C is a positive constant. 

2.2. Hashing with a general convex loss 
function 

Here we derive the algorithm for learning hash func- 
tions with general convex loss. We assume that the 
general convex loss function /(•) is smooth (exponen- 
tial, logistic, squared hinge loss etc.) although our 
algorithm can be easily extended to non-smooth loss 
functions. 

Hashing with li norm regularization Assume that 
we want to find a set of hash functions such that the 
set of constraints djc(xj,x~) — djf(xi,x^~) = it> T a; > 
0,i — 1,2... hold as well as possible. These con- 
straints do not have to be all strictly satisfied. Now, 
we need to define the margin pi — w T &i, and we want 
to maximize the margin with regularization. Using l\ 



norm regularization to control the capacity, we may 
define the primal optimization problem as: 

minV]/(p i ) + C'||u>|| 1 ,s.t. w )?= 0;pi = a[u>, Vi. (7) 

w .p * * 

i— 1 

Here /(•) is a smooth convex loss function; w = 
[wi, W2, ■■■ , is the weight vector that we are in- 
terested in optimizing. C is a parameter controlling 
the trade-off between the training error and model ca- 
pacity. 

Also without this regularization, one can always make 
w arbitrarily large to make the convex loss approach 
zero when all constraints are satisfied. Here because 
the possibility of hash functions can be extremely large 
or even infinite, we are not able to directly solve the 
problem (7). We can use the column generation tech- 
nique to iteratively and approximately solve the orig- 
inal problem. Column generation is a technique orig- 
inally used for large scale linear programming prob- 
lems. Demiriz et al. (2002) used this method to design 
boosting algorithms. At each iteration, one column — 
a variable in the primal or a constraint in the dual 
problem — is added when solving the restricted prob- 
lem. Till one can not find any column violating the 
constraint in the dual, the solution of the restricted 
problem is identical to the optimal solution. Here we 
only need to obtain an approximate solution and in 
order to learn compact codes, we only care about the 
first few (e.g, 60) selected hash functions. In theory, if 
we run the column generation with a sufficient num- 
ber of iterations, one can obtain a sufficiently accurate 
solution (up to a preset precision or no more hash func- 
tions can be found to improve the solution). 

We need to derive a meaningful Lagrange dual in order 
to use column generation. The Lagrangian is: 

pi _ m 

L = ^2 f(Pi) + - P T ™ + ^2 u i( a l w - Pi) 

i=l i=l 

= Cl T w -p T w + Y, Ui&w) - (u T p - £ f( Pi )), 

where p !>= and u are Lagrange multipliers. With 
the definition of Fenchel conjugate (Boyd & Vandcn- 
bcrghe, 2004), we have the following relation: 

inf L = -sup (u T p- EilifiPi)) = -Ei=i/*(«<) and 
p P 

in order to have a finite infimum, CI — p + Au = 
must hold. So we have p !>= 0, Au ^= —CI. Here the 
matrix A is defined in (4). 

Consequently, the corresponding dual problem of (7) 
can be written as: 

mm Ei=i/*K),s.t.Au^-Cl. (8) 
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Here /*(■) is the Fenchel conjugate of /(•). By re- 
versing the sign of u, we can reformulate (8) as its 
equivalent form: 

min Si=i/*( -u i)> s.t. Au 4 CI. (9) 

Since we assume that /(•) is smooth, the Karush- 
Kuhn- Tucker (KKT) condition establishes the connec- 
tion between (9) and (7) at optimality: 

u* = -f'(p*),Vi. (10) 

In other words, the dual variable is determined by the 
gradient of the loss function in the primal. So if we 
solve the primal problem (7), from the primal solution 
w* , we can calculate the dual solution u* using (10). 
But the other way around may not be true. 

The core idea of column generation is to generate a 
small subset of variables, each of which is sequentially 
found by selecting the most violated dual constraints 
in the dual optimization problem (9). This process is 
equivalent to inserting several primal variables into the 
primal optimization problem (7). Here, the subprob- 
lem for generating the most violated dual constraint 
(i.e., to find the best hash function) can be defined as: 

h*(-) = argmax\^' ' n^cfi 
h(.)ew ^ i=1 

E| j | _ , 

. Ui(\h(xi) - h{x t )| - \h(xi) - h(xf)\). 

(11) 

In order to obtain a smoothly differentiable objective 
function, we reformulate (11) into the following equiv- 
alent form: 

|3| 

argmaxy"u i [(/i(x i ) - /i(xr)) 2 - (/i( Xi ) - h(xf)) 2 ]. 

(12) 

The equivalence between (11) and (12) can be trivially 
established. 

To globally solve the optimization problem (12) is in 
general difficult. In the case of decision stumps as 
hash functions, we can usually exhaustively enumerate 
all the possibilities and find the globally best one. In 
the case of linear perception as hash functions, h(x) 
takes the form of sgn(v T x + b) where sgn(-) is the sign 
function. As a result, the binary hash codes are easily 
computed by (l+/i(x))/2. In practice, we relax h(x.) = 
sgn(v T x + b) to h(x) — tanh(v T x + b) with tanh(-) 
being the hyperbolic tangent function. For notional 
simplicity, let and denote tanh(v T Xi + b) — 
tanh(v x^~ +b) and tanh(v T Xi +b) — tanh(v T x^ +b), 
respectively. Then we have the following optimization 



problem: 
&*(■) = 

m 

argmaxVu.pfx,) - ft(x7)) 2 - (h(x<) - /i(x+)) 2 ] 
P\ 

= arg max V u t (r?_ - r 2 + ) . (13) 
v,b r—f 

The above optimization problem can be efficiently 
solved by using LBFGS (Zhu et al, 1997) after fea- 
ture normalization. The initialization of LBFGS can 
be guided by LSH (Andoni & Indyk, 2006). Namely, 
we first generate a set of candidate samples such that 
v - N(0, 1) and b - U(-l, 1) with N(-) and U(-) re- 
spectively being the normal and uniform distributions. 
Then, we use the best candidate sample as the initial- 
ization that maximizes the objective function (13). In 
our experiments, we have used linear perception as 
hash functions. 

Hashing with norm regularization We show 
here that we can also use other regularization terms 
such as the norm. With the norm regularization, 
the primal problem is defined as: 

131 T 

min EUi f(Pi) + CIMIoc, s.t. w ^ 0;pi = a, w, Vz. 

w.p 

(14) 

This optimization problem is equivalent to: 

min T/i=if(pi), s.t. ||«j||oo < C';w )p 0; p t = a[«;,Vi, 

w,p 

(15) 

where C is a properly selected constant, related to C 
in (14). Due to w and ||tu||oo < C"> we obtain =^ 
w =4 C'l. Therefore, the Lagrangian can be written 
as: 

iji _ m 
1=1 »=i 

where p, q, u are Lagrange multipliers. Similar to the 
i\ norm case, we can easily derive the dual problem 
as: 

min Ei=i/* («<) + C'l T q, s.t. Au ^ -q. (16) 

u,q 

By reversing the sign of u, we can reformulate (16) as 
its equivalent form: 

mm ESl/*(-«i) + G'l T q, s.t. Au 4 q. (17) 

u,q 

The KKT condition in this regularized case is the 
same as (10). Also the rule to generate the best hash 
function (i.e., the most violated constraint in (17)) re- 
mains the same as in the l\ norm case that we have 
discussed. Note that both the primal problems (7) 
and (15) can be efficiently solved using quasi-Newton 
methods such as L-BFGS-B (Zhu et al, 1997) by elim- 
inating the auxiliary variable p. 
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Figure 1. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on the 
ISOLET dataset. The left plot shows the average precision-recall performances using 60 bits. The middle plot shows the 
average performances using different code lengths measured as the proportion of the true nearest neighbors with top-50 
retrieval. The right plot shows the average 3-nearest-neighbor classification performances using different code lengths. 
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Figure 2. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on the 
SCENE-15 dataset. The description of each plot is the same as in Fig. 1. 



Extension To demonstrate the flexibility of the pro- 
posed framework, we show an example that consid- 
ers an addition pairwise information. Assume that 
we have information about a set of duplets that they 
are neighbors to each other or they are from the 
same class. So the distance between these duplets 
should be minimized. We can easily include such 
a term in our objective function. Formally, let us 
denote the duplet set as D = { (x^ , x^ ) } and we 

want to minimize the divergence Efc=i ^ft( x fe) x fe ) = 
EjWjCEfclMxfe) ~ h j( x k)\) = Ej s j^j with s 3 = 
Efe=i l^j( x fc) ~hj(' K i)\ being a nonnegative constant 
given hj(-). If we use this term to replace the l\ reg- 
ularization term ^ . wj in the primal (7), all of our 
analysis still holds and Algorithm 1 is still applicable 
with minimal modification, because the new term can 
be simply seen as a weighted li norm. 

3. Experimental results 

Experimental setup In order to evaluate the pro- 
posed column generation hashing method (referred to 
as CGHash), we have conducted a set of experiments 
on six benchmark datasets. To train data-dependent 
hash functions, each dataset is randomly split into 
a training subset and a testing subset. This train- 
ing/testing split is repeated 5 times, and the average 



performance over these 5 trials is reported here. 

In the experiments, the proposed hashing method is 
implemented by using the squared hinge loss func- 
tion with the l\ regularization norm (as shown in the 
supplementary file). Moreover, the triplets used for 
learning hash functions are generated in the same way 
as (Weinberger et al., 2006). Specifically, given a train- 
ing sample, we select the K nearest neighbors from 
its associated same-label training samples as relevant 
samples, and then choose the K nearest neighbors 
from its associated different-label training samples as 
irrelevant samples (K = 30 for the SCENE-15 dataset 
and K = 10 for the other datasets). The trade-off 
control factor C is cross- validated. We found that, in 
a wide range, the trade-off control factor C does not 
have a significant impact on the performance. 

Competing methods To demonstrate the effec- 
tiveness of the proposed hashing method (CGHash), 
we compare with some other state-of-the-art hashing 
methods quantitatively. For simplicity, they are re- 
spectively referred to as LSH (Locality Sensitive Hash- 
ing (Andoni & Indyk, 2006)), SSC (Supervised Sim- 
ilarity Sensitive Coding (Torralba et al., 2008) as a 
modified version of (Shaklmarovich et al., 2003)), LSI 
(Latent Semantic Indexing (Deerwester et al., 1990)), 
LCH (Laplacian Co-Hashing (Zhang et al., 2010a)), 
SPH (Spectral Hashing (Weiss et al., 2008)), STH 
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Figure 3. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on the 
MNIST dataset. The description of each plot is the same as in Fig. 1. 
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Figure 4. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on a subset 
of the LABELME dataset. The description of each plot is the same as in the previous figures. 



(Self-Taught hashing (Zhang ct al., 2010b)), AGH 
(Anchor Graph Hashing (Liu et al., 2011)), BREs 
(Supervised Binary Reconstructive Embedding (Kulis 
& Darrell, 2009)), SPLH (Semi-Supervised Learning 
Hashing (Wang et al, 2012)), and ITQ (Iterative 
Quantization (Gong et al., 2012)). Making a com- 
parison with the above competing methods can ver- 
ify the effect of learning hashing functions and show 
the performance differences in the context of hashing 
methods. 

Evaluation criteria For a quantitative performance 
comparison, we introduce the following three evalu- 
ation criteria: i) precision-recall curve; ii) propor- 
tion of true neighbors in top-A: retrieval; and iii) K- 
nearest-neighbor classification. In the experiments, 
the aforementioned retrieval performance scores are 
averaged over all test queries in the dataset. For 
i), the precision-recall curve is computed as follows: 

#retrieved relevant sampels , nl 

precision = — n — i—- a ^ — and recall = 

1 #all retrieved samples 

#retrieved relevant sampels 

#all relevant samples 

of true neighbors in top-fc retrieval is calculated as: 

#retrieved true neighbors ^ ...\ , , , 
— t . .bor m), each test sample 

is classified by a majority voting in jFC-nearest-neighbor 

classification. 

Quantitative comparison results Figs. 1-6 show 
the retrieval and classification performances of all the 
hashing methods using different code lengths on the 
six datasets. In each of these figures, we report quan- 



For ii), the proportion 



titative comparison results of all the hashing methods 
in the following three aspects: 1) the average precision- 
recall performances using the maximum code length, 
and the average precisions together with standard de- 
viations (as shown in the legend of each figure); 2) 
the average performances using different code lengths 
in the proportion of the true nearest neighbors with 
top-50 retrieval, and the average proportion results to- 
gether with their standard deviations in the case of the 
maximum code length (as shown in the legend of each 
figure); and 3) the average Zv-nearest-neighbor classifi- 
cation performances using different code lengths, and 
the average classification results together with their 
standard deviations in the case of the maximum code 
length (as shown in the legend of each figure). 

From Figs. 1-6, we clearly see that the proposed 
CGHash obtains the larger areas under the precision- 
recall curves than the competing hashing methods. In 
addition, we observe that CGHash achieves the higher 
proportions of the true nearest neighbors with top- 
50 retrieval at most times. Moreover, it is seen that 
CGHash has lower classification errors than the com- 
peting methods in most cases. 

Fig. 7 shows the retrieval and classification perfor- 
mances of the proposed CGHash using different val- 
ues of K on the SCENE-15 dataset. It is seen from 
Fig. 7 that in general the performance is improved as 
K increases. 

Besides, Fig. 8 shows two retrieval examples on the 
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Figure 5. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on the 
USPS dataset. The description of each plot is the same as in the previous figures. 
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Figure 6. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on the 
PASCAL07 dataset. The description of each plot is the same as in the previous figures. 
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Figure 7. The retrieval and classification performances of the proposed CGHash using different values of K (K £ 
{3,10,20,30}) on the SCENE-15 dataset. The left plot shows the average precision-recall performances using 60 bits. 
The middle plot displays the average performances using different code lengths measured as the proportion of the true 
nearest neighbors with top-50 retrieval. The right plot shows the average 3-nearest-neighbor classification performances 
using different code lengths. 
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Figure 8. Two retrieval examples for CGHash on the LA- 
BELME and MNIST datasets. The left part shows query 
samples while the right part displays the first a few nearest 
neighbors obtained using CGHash. 

MNIST and LABELME datasets. From Fig. 8, we 
observe that CGHash obtains the visually accurate 
nearest-neighbor-search results . 

Conclusion We have proposed a novel hashing 



method that is implemented using column generation- 
based convex optimization. By taking into account a 
set of constraints on the triplet-based relative rank- 
ing, the proposed hashing method is capable of learn- 
ing compact hash codes. Such a set of constraints 
are incorporated into the large-margin learning frame- 
work. Hash functions are then learned iteratively 
using column generation. Experimental results on 
several datasets have shown that the proposed hash- 
ing method achieves improved performance compared 
with state-of-the-art hashing methods in nearest- 
neighbor classification, precision-recall, and propor- 
tion of true nearest neighbors retrieved. 

This work is in part supported by ARC grants 
LP 120200485 and FT 120 100969. 



Learning Hash Functions Using Column Generation 



References 

Andoni, A. and Indyk, P. Near-optimal hashing al- 
gorithms for approximate nearest neighbor in high 
dimensions. In Proc. IEEE Symp. Foundations of 
Computer Science, pp. 459-468, 2006. 

Baluja, S. and Covell, M. Learning to hash: forgiving 
hash functions and applications. Data Mining & 
Knowledge Discovery, 17(3):402-430, 2008. 

Boyd, S. and Vandenberghe, L. Convex Optimization. 
Cambridge University Press, 2004. 

Deerwester, S., Dumais, S.T., Furnas, G.W., Lan- 
dauer, T.K., and Harshman, R. Indexing by latent 
semantic analysis. J. American Society for Informa- 
tion Science, 41(6):391-407, 1990. 

Demiriz, A., Bennett, K.P., and Shawe- Taylor, J. Lin- 
ear programming boosting via column generation. 
Machine Learning, 46(l):225-254, 2002. 

Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F. 
Iterative quantization: a procrustean approach to 
learning binary codes for large-scale image retrieval. 
IEEE Trans. Pattern Analysis & Machine Intelli- 
gence, 2012. 

Korman, S. and Avidan, S. Coherency sensitive hash- 
ing. In Proc. Int. Conf. Computer Vision, pp. 1607- 
1614, 2011. 

Kulis, B. and Darrell, T. Learning to hash with binary 
reconstructive embeddings. In Proc. Adv. Neural 
Information Process. Systems, 2009. 

Kulis, B. and Grauman, K. Kernelized locality- 
sensitive hashing for scalable image search. In Proc. 
Int. Conf. Computer Vision, pp. 2130-2137, 2009. 

Li, X., Lin, G., Shen, C, van den Hengel, A., and Dick, 
A. Supplementary document: Effectively learn- 
ing hash functions using column generation, avail- 
able at: http://cs.adelaide.edu.au/~chhshen/ 
paper.html, 2013. 

Liu, W., Wang, J., Kumar, S., and Chang, S. F. Hash- 
ing with graphs. In Proc. Int. Conf. Machine Learn- 
ing, 2011. 

Liu, W., Wang, J., Ji, R., Jiang, Y.G., and Chang, 
S.F. Supervised hashing with kernels. In Proc. 
IEEE Conf. Computer Vision & Pattern Recogni- 
tion, 2012. 

Norouzi, M. and Fleet, D.J. Minimal loss hashing for 
compact binary codes. In Proc. Int. Conf. Machine 
Learning, 2011. 



Salakhutdinov, R. and Hinton, G. Semantic hash- 
ing. Int. J. Approximate Reasoning, 50(7):969-978, 
2009. 

Schultz, M. and Joachims, T. Learning a distance met- 
ric from relative comparisons. In Proc. Adv. Neural 
Information Processing Systems, 2004. 

Shakhnarovich, G., Viola, P., and Darrell, T. Fast 
pose estimation with parameter-sensitive hashing. 
In Proc. Int. Conf. Computer Vision, pp. 750-757, 
2003. 

Shen, C, Kim, J., Wang, L., and van den Hengel, A. 
Positive semidefmite metric learning using boosting- 
like algorithms. J. Machine Learning Research, 13: 
1007-1036, 2012. 

Strecha, C, Bronstein, A. M., Bronstcin, M. M., and 
Fua, P. Ldahash: Improved matching with smaller 
descriptors. IEEE Trans. Pattern Analysis & Ma- 
chine Intelligence, 2011. 

Torralba, A., Fergus, R., and Weiss, Y. Small codes 
and large image databases for recognition. In Proc. 
IEEE Conf. Computer Vision & Pattern Recogni- 
tion, pp. 1-8, 2008. 

Wang, J., Kumar, S., and Chang, S.F. Semi- 
supervised hashing for large scale search. IEEE 
Trans. Pattern Analysis & Machine Intelligence, 
2012. 

Weinberger, K.Q., Blitzer, J., and Saul, L.K. Distance 
metric learning for large margin nearest neighbor 
classification. In Proc. Adv. Neural Information Pro- 
cessing Systems, 2006. 

Weiss, Y., Torralba, A., and Fergus, R. Spectral hash- 
ing. In Proc. Adv. Neural Information Process. Sys- 
tems, 2008. 

Zhang, D., Wang, J., Cai, D., and Lu, J. Laplacian 
co-hashing of terms and documents. In Proc. Eur. 
Conf. Information Retrieval, pp. 577-580, 2010a. 

Zhang, D., Wang, J., Cai, D., and Lu, J. Self-taught 
hashing for fast similarity search. In Proc. ACM 
SIGIR Conf, pp. 18-25, 2010b. 

Zhu, C, Byrd, R. H., Lu, P., and Nocedal, J. Al- 
gorithm 778: L-BFGS-B: Fortran subroutines for 
large-scale bound-constrained optimization. ACM 
Trans. Math. Softw., 23(4):550-560, 1997. 



